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Fisher-consistent loss functions play a fundamental role in the 
construction of successful binary margin-based classifiers. In this pa- 
per we establish the Fisher-consistency condition for multicategory 
classification problems. Our approach uses the margin vector con- 
cept which can be regarded as a multicategory generalization of the 
binary margin. We characterize a wide class of smooth convex loss 
functions that are Fisher-consistent for multicategory classification. 
We then consider using the margin-vector-based loss functions to de- 
rive multicategory boosting algorithms. In particular, we derive two 
new multicategory boosting algorithms by using the exponential and 
logistic regression losses. 

1. Introduction. The margin-based classifiers, including the support 
vector machine (SVM) [Vapnik (1996)] and boosting [Freund and Schapire 
(1997)], have demonstrated their excellent performances in binary classifica- 
tion problems. Recent statistical theory regards binary margin-based clas- 
sifiers as regularized empirical risk minimizers with proper loss functions. 
Friedman, Hastie and Tibshirani (2000) showed that AdaBoost minimizes 
the novel exponential loss by fitting a forward stage-wise additive model. In 
the same spirit, Lin (2002) showed that the SVM solves a penalized hinge loss 
problem and the population minimizer of the hinge loss is exactly the Bayes 
rule, thus, the SVM directly approximates the Bayes rule without estimat- 
ing the conditional class probability. Furthermore, Lin (2004) introduced the 
concept of Fisher-consistent loss in binary classification and he showed that 
any Fisher-consistent loss can be used to construct a binary margin-based 
classifier. Buja, Stuetzle and Shen (2005) discussed the proper scoring rules 
for binary classification and probability estimation which are closely related 
to the Fisher-consistent losses. 



Received March 2008; revised August 12. 
Supported by NSF grant DMS 07-06-733. 

Key words and phrases. Boosting, Fisher-consistent losses, multicategory classification. 

This is an electronic reprint of the original article published by the 

Institute of Mathematical Statistics in The Annals of Applied Statistics. 

2008, Vol. 2, No. 4, 1290-1306. This reprint differs from the original in pagination 

and typographic detail. 



1 



2 



H. ZOU, J. ZHU AND T. HASTIE 



In the binary classification case, the Fisher-consistent loss function the- 
ory is often used to help us understand the successes of some margin-based 
classifiers, for the popular classifiers were proposed before the loss function 
theory. However, the important result in Lin (2004) suggests that it is possi- 
ble to go the other direction: we can first design a nice Fisher-consistent loss 
function and then derive the corresponding margin-based classifier. This 
viewpoint is particularly beneficial in the case of multicategory classifica- 
tion. There has been a considerable amount of work in the literature to 
extend the binary margin-based classifiers to the multi-category case. A 
widely used strategy for solving the multi-category classification problem is 
to employ the one-versus-all method [Allwein, Schapire and Singer (2000)], 
such that a m-class problem is reduced to m binary classification problems. 
Rifkin and Klautau (2004) gave very provocative arguments to support the 
one-versus-all method. AdaBoost.MH [Schapire and Singer (1999)] is a suc- 
cessful example of the one-versus-all approach which solves a m-class prob- 
lem by applying AdaBoost to m binary classification problems. However, the 
one-versus-all approach could perform poorly with the SVM if there is no 
dominating class, as shown by Lee, Lin and Wahba (2004). To fix this prob- 
lem, Lee, Lin and Wahba (2004) proposed the multicategory SVM. Their 
approach was further analyzed in Zhang (2004a). Liu and Shen (2006) and 
Liu, Shen and Doss (2005) proposed the multicategory psi-machine. 

In this paper we extend Lin's Fisher-consistency result to multicategory 
classification problems. We define the Fisher-consistent loss in the context 
of multicategory classification. Our approach is based on the margin vector, 
which is the multicategory generalization of the margin in binary classifi- 
cation. We then characterize a family of convex losses which are Fisher- 
consistent. With a multicategory Fisher-consistent loss function, one can 
produce a multicategory boosting algorithm by employing gradient decent 
to minimize the empirical margin- vector-based loss. To demonstrate this 
idea, we derive two new multicategory boosting algorithms. 

The rest of the paper is organized as follows. In Section 2 we briefly 
review binary margin-based classifiers. Section 3 contains the definition of 
multicategory Fisher-consistent losses. In Section 4 we characterize a class 
of convex multicategory Fisher-consistent losses. In Section 5 we introduce 
two new multicategory boosting algorithms that are tested on benchmark 
data sets. Technical proofs are relegated to the Appendix. 

2. Review of binary margin-based losses and classifiers. In standard 
classification problems we want to predict the label using a set of features. 
y 6 C is the label where C is a discrete set of size m, and x denotes the 
feature vector. A classification rule 5 is a mapping from x to C such that a 
label <5(x) is assigned to the data point x. Under the 0-1 loss, the misclas- 
sification error of 8 is R(5) = P(y ^ <5(x)). The smallest classification error 
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is achieved by the Bayes rule argmax Cig ep(y = Cj|x). The conditional class 
probabilities p(y = Cj|x) are unknown, so is the Bayes rule. One must con- 
struct a classifier 5 based on n training samples (yi, Xj), i = 1, 2, . . . , n, which 
are independent identically distributed (i.i.d.) samples from the underlying 
joint distribution p(y,x). 

In the book by Hastie, Tibshirani and Friedman (2001) readers can find 
detailed explanations of the support vector machine and boosting. Here we 
briefly discuss a unified statistical view of the binary margin-based classi- 
fier. In the binary classification problem, C is conveniently coded as {1,-1}, 
which is important for the binary margin-based classifiers. Consider a margin- 
based loss function 4>(y,f) = <ft(yf), where the quantity yf is called the 
margin. We define the empirical (f> risk as EMR n (c/>, /) = - J2?=i 4>(yif(' K i))- 
Then a binary margin-based (ft classifier is obtained by solving 

/»=argminEMR n (0 ) /), 

where T n denotes a regularized functional space. The margin-based clas- 
sifier is sign(f^ n ' ) (x.)). For the SVM, eft is the hinge loss and J- n is the 
collection of penalized kernel estimators. AdaBoost amounts to using the 
exponential loss 4>(y, f) = exp(-yf) and T n is the space of decision trees. 
The loss function plays a fundamental role in the margin-based classifica- 
tion. Friedman, Hastie and Tibshirani (2000) justified AdaBoost by show- 
ing that the population minimizer of the exponential loss is one-half the 
log-odds. Similarly, in the SVM case, Lin (2002) proved that the population 
minimizer of the hinge loss is exactly the Bayes rule. 

Lin (2004) further discussed a class of Fisher-consistent losses. A loss 
function <j) is said to be Fisher-consistent if 

/(x) = argmin[^(/(x))p(y = l|x) + c/>(-/(x))p(y = -l|x)] 
/W 

has a unique solution /(x) and 

sign(f(-x)) = sign(p(y = l|x) - 1/2). 

The Fisher-consistent condition basically says that with infinite samples, 
one can exactly recover the Bayes rule by minimizing the <p loss. 

3. Multicategory Fisher-consistent losses. In this section we extend Lin's 
Fisher-consistent loss idea to the multicategory case. We let C = {1, 2, . . . , m} 
(m > 3). From the definition of the binary Fisher-consistent loss, we can re- 
gard the margin as an effective proxy for the conditional class probability, 
if the decision boundary implied by the "optimal" margin is identical to 
the Bayes decision boundary. To better illustrate this interpretation of the 
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margin, recall that sign{p{y = l|x) — 1/2) is the Bayes rule for binary clas- 
sification and 

sign(p(y = l|x) - 1/2) = sign(p(y = l|x) -p(y = -l\x)), 

sign(f(x)) = sign(f(x) - (-/(x))). 

The binary margin is defined as yf. Since yf = f or — /, an equivalent 
formulation is to assign margin / to class 1 and margin —/to class —1. We 
regard / as the proxy of p(y = l|x) and — / as the proxy of p(y = — l|x), for 
the purpose of comparison. Then the Fisher-consistent loss is nothing but 
an effective device to produce the margins that are a legitimate proxy of the 
conditional class probabilities, in the sense that the class with the largest 
conditional probability always has the largest margin. 

We show that the proxy interpretation of the margin offers a graceful mul- 
ticategory generalization of the margin. The multicategory margin is concep- 
tually identical to the binary margin, which we call the mar gin- vector. We 
define the margin vector together with the multicategory Fisher-consistent 
loss function. 

Definition 1. A m-vector / is said to be a margin vector if 

m 

(3-1) £/i = °- 

Suppose (/)(•) is a loss function and /(x) is a margin vector for all x. Let 
Pj = p(y = il x )> j = 1, 2, ...,m, be the conditional class probabilities and 
denote p = (• • - pj ■ • •). Then we define the expected (ft risk at x: 

m 

(3-2) 0(p, /(x)) = J2 <Kfj{*))p(v = ilx). 

i=i 

Given n i.i.d. samples, the empirical margin-vector based (ft risk is given by 

1 n 

(3-3) EMR^) = -£>CUxi)). 

1 = 1 

A loss function <ft(-) is said to be Fisher- consistent for m-class classification 
if Vx in a set of full measure, the following optimization problem 

m 

(3.4) /(x) =argmin0(p,/(x)) subject to V/j(x)=0 

/(x) j=1 

has a unique solution /, and 

(3.5) argmax/ ? (x) = argmaxp(y = j|x). 

j 3 

Furthermore, a loss function (ft is said to be universally Fisher- consistent 
if (ft is Fisher-consistent for m-class classification Vm > 2. 
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We have several remarks. 

Remark 1. We assign a margin /,• to class j as the proxy of the con- 
ditional class probability p(y = The margin vector satisfies the sum- 
to-zero constraint such that when m = 2, the margin vector becomes the 
usual binary margin. The sum-to-zero constraint also ensures the existence 
and uniqueness of the solution to (3.3). The sum-to-zero constraint was also 
used in Lee, Lin and Wahba (2004). 

Remark 2. We do not need any special coding scheme for y in our 
approach, which is very different from the proposal in Lee, Lin and Wahba 
(2004). The data point (yi,Xj) belongs to class yi, hence, its margin is /^(xj) 
and its margin-based risk is (j)(f yi (xi)). Thus, the empirical risk is defined as 
that in (3.3). If we only know x, then y can be any class j with probability 
p(y = j|x), hence, we consider the expected risk defined in (3.2). 

Remark 3. The Fisher-consistent condition is a direct generalization of 
the definition of the Fisher-consistent loss in binary classification. It serves 
the same purpose: to produce a margin vector that is a legitimate proxy of 
the conditional class probabilities such that comparing the margins leads to 
the multicategory Bayes rule. 

Remark 4. There are many nice Fisher-consistent loss functions for bi- 
nary classification. It would be interesting to check if these losses for binary 
classification are also Fisher-consistent for multicategory problems. This 
question will be investigated in Section 4 where we show that most of pop- 
ular loss functions for binary classification are universally Fisher-consistent. 

Remark 5. Buja, Stuetzle and Shen (2005) showed the connection be- 
tween Fisher-consistent losses and proper scoring rules which estimate the 
class probabilities in a Fisher consistent manner. Of course, in classification 
it is sufficient to estimate the Bayes rule consistently, the Fisher-consistent 
condition is weaker than proper scoring rules. However, we show in the next 
section that many Fisher-consistent losses do provide estimates of the class 
probabilities. Thus, they can be considered as the multicategory proper scor- 
ing rules. 

4. Convex multicategory Fisher-consistent losses. In this section we show 
that there are a number of Fisher-consistent loss functions for multicategory 
classification. In this work all loss functions are assumed to be non-negative. 
Without loss of generality, we assume argmax CiS cp(y = c«|x) is unique. We 
have the following sufficient condition for a differentiable convex function to 
be universally Fisher-consistent. 
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Theorem 1. Let <j){t) be a twice differentiable loss function. If(p'(0) < 
and (f>"(t) > Vi, then <f> is universally Fisher- consistent. Moreover, letting 
f be the solution of (3.4), then we have 

, W(/j(aO) 

(4.1) p{y = j\x)- 



Theorem 1 immediately concludes that the two most popular smooth loss 
functions, namely, exponential loss and logistic regression loss (also called 
logit loss hereafter), are universally Fisher-consistent for multicategory clas- 
sification. The inversion formula (4.1) also shows that once the margin vec- 
tor is obtained, one can easily construct estimates for the conditional class 
probabilities. It is remarkable because we can not only do classification but 
also estimate the conditional class probabilities without using the likelihood 
approach. 

The conditions in Theorem 1 can be further relaxed without weakening 
the conclusion. Supposing eft satisfies the conditions in Theorem 1, we can 
consider the linearized version of <f>. Define the set A as given in the proof of 
Theorem 1 (see Section 6) and let t\ = inf A. If A is empty, we let t\ = oo. 
Choosing a t 2 < 0, then we define a new convex loss as follows: 

(<P'{t 2 )(t-t 2 ) + <t>{t2), if t<t 2 , 
C(t) = <<£(*)> if t2<t <*i, 

U(*i), ifii<t 

As a modified version of <j>, £ is a decreasing convex function and approaches 
infinity linearly. We show that £ is also universally Fisher-consistent. 

Theorem 2. is universally Fisher- consistent and (4-1) holds for Q. 

Theorem 2 covers the squared hinge loss and the modified Huber loss. 
Thus, Theorems 1 and 2 conclude that the popular smooth loss functions 
used in binary classification are universally Fisher-consistent for multicate- 
gory classification. In the reminder of this section we closely examine these 
loss functions. 



4.1. Exponential loss. We consider the case 4>\{t) = e - *, <fii(t) = —e~ l 
and (f>i(t) = e~ l '. By Theorem 1, we know that the exponential loss is univer- 
sally Fisher-consistent. In addition, the inversion formula (4.1) in Theorem 
1 tells us that 
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To express / by p, we write 



fj =logOj) +log XX* 
\fc=i 



Since Y%Li fj = 0> we conclude that 



or equivalently, 



= l °g(Pj) + mlogi ^ e fk ) , 
j'=i \fc=i / 



/j = log(p./) V log(p fc ). 

m ~ 

k=l 



Thus, the exponential loss derives exactly the same estimates by the multi- 
nomial deviance function. 

4.2. Logit loss. The logit loss function is 4>2(t) = log(l + e _t ), which is 
essentially the negative binomial deviance. We compute </>' 2 (t) = jzhz and 
^2 (0 = (l+e*)' 2 ' Then Theorem 1 says that the logit loss is universally Fisher- 
consistent. By the inversion formula (4.1), we also obtain 

1 + S 

Pj 



To better appreciate formula (4.1), let us try to express the margin vector 
in terms of the class probabilities. Let A* = Yuk=iiX + e^ fc ). Then we have 

/,=log(-l+^A*). 

Note that fj = 0> thus, A* is the root of equation 

m 

5>g(-l+ Pi A) = 0. 

When m = 2, it is not hard to check that A* = p\P2- Hence, /i = log(^i-) and 

fi = log(^-), which are the familiar results for binary classification. When 

m > 2, / depends on p in a much more complex way. But p is always easily 
computed from the margin vector /. 

The logit loss is quite unique, for it is essentially the negative (conditional) 
log-likelihood in the binary classification problem. In the multicategory prob- 
lem, from the likelihood point of view, the multinomial likelihood should be 
used, not the logit loss. From the viewpoint of the Fisher-consistent loss, the 



H. ZOU, J. ZHU AND T. HASTIE 



logit loss is also appropriate for the multicategory classification problem, be- 
cause it is universally Fisher-consistent. We later demonstrate the usefulness 
of the logit loss in multicategory classification by deriving a multicategory 
logit boosting algorithm. 

4.3. Least squares loss, Squared hinge loss and modified Ruber loss. The 
least squares loss is 03 (t) = (1 — t) 2 . We compute 3 (i) = 2(t — 1) and cj>^(t) = 
2. 0'(O) = —2, hence, by Theorem 1, the least squares loss is universally 
Fisher-consistent. Moreover, the inversion formula (4.1) shows that 

V(i-A) 



E2U V(i - fk) 

We observe that fj = l- (p^A,)" 1 , where A* = ££Ll 1/(1 - A). £j? =1 £• = 
implies that A* is the root of equation £jLi(l — (Apj) -1 ) = 0. We solve 
A* = ^(Er=ilM)-Thus, 

When m = 2, we have the familiar result: f\ = 2p\ — 1, by simply using 
l/p\ + 1/.P2 = l/piP2- In multicategory problems the above formula says 
that with the least squares loss, the margin vector is directly linked to the 
inverse of the conditional class probability. 

We consider 04 (t) = (1 — t) 2 + , where "+" means the positive part. 04 is 
called the squared hinge loss. It can be seen as a linearized version of least 
squares loss with t\ = 1 and = —oo. By Theorem 2, the squared hinge 
loss is universally Fisher-consistent. Furthermore, it is interesting to note 
that the squared hinge loss shares the same population minimizer with least 
squares loss. 

Modified Huber loss is another linearized version of least squares loss with 
ti = 1 and t2 = —1, which is expressed as follows: 

(-At, if t < — 1, 

5 (i) = Ut-l) 2 , if-Kt<l, 
[ 0, if 1 < t. 

By Theorem 2, we know modified Huber loss is universally Fisher-consistent. 
The first derivative of 05 is 

r-4, if t < — 1, 

<f>' 5 (t) = \ 2(t-l), if-Kt<l, 
I 0, if 1 < t, 

which is used to convert the margin vector to the conditional class proba- 
bility. 
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Algorithm 5.1 Multicategory GentleBoost 

1. Start with Wi = 1, i = 1,2, . . . , n, Gj(x) = 0, j = 1, . . . , m. 

2. For k = 1 to M, repeat: 

(a) For j = 1 to m, repeat: 

i. Let Zi = — 1/m + /(y^ =i)- Compute = WjZ? and re-normalize. 

ii. Fit the regression function gj (x) by weighted least-squares of work- 
ing response z~ l to Xj with weights w*. 

hi. Update Gj(x) = Gj(x) + ffj(x). 

(b) Compute /. (x) = G,-(x) - ± G fc (x). 

(c) Compute = exp(-/ J , i (x i )). 

3. Output the classifier arg maxj fj (x) . 



5. Multicategory boosting algorithms. In this section we take advan- 
tage of the multicategory Fisher-consistent loss functions to construct mul- 
ticategory classifiers that treat all classes simultaneously without reducing 
the multicategory problem to a sequence of binary classification problems. 
We follow Friedman, Hastie and Tibshirani (2000) and Friedman (2001) to 
view boosting as a gradient decent algorithm that minimizes the expo- 
nential loss. This view was also adopted by Biihlmann and Yu (2003) to 
derive L2-boosting. For a nice overview of boosting, we refer the readers 
to Buhlamnn and Hothorn (2007). Borrowing the gradient decent idea, we 
show that some new multicategory boosting algorithms naturally emerge 
when using multicategory Fisher-consistent losses. 

5.1. GentleBoost. Friedman, Hastie and Tibshirani (2000) proposed the 
binary Gentle AdaBoost algorithm to minimize the exponential loss by us- 
ing regression trees as base learners. In the same spirit we can derive the 
multicategory GentleBoost algorithm, as outlined in Algorithm 5.1. 

5.1.1. Derivation of GentleBoost. By the symmetry constraint on /, we 
consider the following representation: 

m 

(5.1) £(x) = G,(x) --J2 G*(x) for j = 1, . . . ,m. 

fc=i 

No restriction is put on G. We write the empirical risk in terms of 67: 

1 n ( 1 m \ 

(5.2) -E ex P \-Gy i {xi) + -Y, G ^) )--=HG)- 

n i=l \ m k=l / 

We want to find increments on G such that the empirical risk decreases 
most. Let <?(x) be the increments. Following the derivation of the Gentle 
AdaBoost algorithm in Friedman, Hastie and Tibshirani (2000), we consider 
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Algorithm 5.2 AdaBoost.ML 



1. Start with /j(x) = 0, j = 1, . . . , m. 

2. For fc = 1 to M: 

(a) Compute weights Wi = A , n and re-normalize. 

(b) Fit a m-class classifier T^(x) to the training data using weights u>j. 
Define 



5j( x ) 



m — 1 



m 



1 



m(m — 1) 



if T fe (x) = j, 
ifT fc (x)^j. 



(c) Compute % = argmin 7 ± £f =1 !og(l + ex P(-/ K ( x i) ~ 7%( x i)))- 

(d) Update /(x)^/(x)+7 fc5 (x). 
3. Output the classifier argmaxj /j(x). 



the expansion of (5.2) to the second order and use a diagonal approximation 
to the Hessian, then we obtain 



L(G + g) « L(G) J2[ 9k(^i)z ik exp(- f yi (xi)) 



n . 

t=l \k=l 



1 n 1 / m \ 

n i= i 1 \k=i I 

where Zn- = —1/m + I{yi = k). For each j, we seek gj(x) that minimizes 

n n 1 

- X)&( x i)% eX P(-/^( X i)) + H nfj ( x i)4 eX P(-/^( X *))- 
i=l i=l Z 

A straightforward solution is to fit the regression function gj (x) by weighted 
least-squares of z^ 1 to Xj with weights zfj exp(— / yi (xj)). Then / is updated 
accordingly by (5.1). In the implementation of the multicategory Gentle- 
Boost algorithm we use regression trees to fit gj(x). 

5.2. AdaBoost.ML. We propose a new logit boosting algorithm (Algo- 
rithm 5.2) by minimizing the binary logit risk. Similar to AdaBoost, the new 
logit boosting algorithm aggregates the multicategory decision tree, thus, we 
call it AdaBoost.ML. 

5.2.1. Derivation of AdaBoost.ML. We use the gradient decent algo- 
rithm to find /(x) in the space of margin vectors to minimize 

1 n 

EER n (/) = -^log(l + exp(-/ w (x i ))). 

1 = 1 



MULTICATEGORY BOOSTING AND FISHER-CONSISTENT LOSSES 



11 



Supposing /(x) is the current fit, the negative gradient of the empirical 
logit risk is (- • i +cxp (j ( x .)) )i=i,...,n- After normalization, we can take the 

negative gradient as (wi)i=i, the weights in 2(a). 

Second, we find the optimal incremental direction <?(x), which is a func- 
tional in the margin-vector space and best approximates the negative gra- 
dient direction. Thus, we need to solve the following optimization problem: 

n mm 

(5.3) arg max Wjg Vi (x^ ) subject to gj = and g 2 = 1. 

i j=l j=l 

On the other hand, we want to aggregate multicategory classifiers, thus, 
the increment function g(x) should be induced by a m-class classifier T(x). 
Consider a simple mapping from T to g 

a, if j = T(x), 
-6, ifi^T(x), 

where a > and b > 0. The motivation of using the above rule comes from 
the proxy interpretation of the margin. The classifier T predicts that class 
T(x) has the highest conditional class probability at x. Thus, we increase 
the margin of class T(x) by a and decrease the margin of other classes by 
b. The margin of the predicted class relatively gains (a + b) against other 
less favorable classes. We decrease the margins of the less favorable classes 
simply to satisfy the sun-to-zero constraint. By the constraints in (5.3), we 
have 

n n 

= ^ gj = a — (m — 1)6 and 1 = gj = a 2 + (m — l)b 2 . 
i=i j=i 



Thus, a = \J\ — 1/m and b = l/y/m(m — 1). Observe that 



J2wig yi (xi)= m)\/l-l/m- Yl Wi)l/\fm(m-l), 



=1 MeCC / MeNC 

where 

CC = {i:y l =T(x l )} and NC = {i : Vl ± T(x 4 )}. 

Thus, we need to find a classifier T to maximize J2i - ecc w i^ w hich amounts 
to fitting a classifier T(x) to the training data using weights Wi. The fitted 
classifier T(x) induces the incremental function <?(x). 

Then for a given incremental direction g(x), in 2(d) we compute the step 
length by solving 

1 n 

7 = ar g m 7 in - H log ( 1 + exp ( - f y . (xj ) - jg yi (xj ) ) ) . 
The updated fit is /(x) +7^(x). The above procedure is repeated M times. 
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Table 1 
Data sets used in the experiments 



Data 


No. Train 


No. Test 


Inputs 


Classes 


CART error 


Waveform 


300 


5000 


21 


3 


31.6% 


Vowel 


528 


462 


10 


11 


54.1% 


Optdigits 


3823 


1797 


64 


10 


16.6% 


Image segmentation 


210 


2100 


19 


7 


9.8% 


Pendigits 


7494 


3498 


16 


10 


8.32% 



5.3. Some experiments with real-world data. Here we show the results 
of comparing the three multicategory boosting algorithms, AdaBoost.MH, 
GentleBoost and AdaBoost.ML, on several benchmark data sets obtained 
from the UCI machine learning repository [Newman, Hettich and Merz (1998)]. 
The number of boosting steps was 200 in all algorithms and examples. For 
reference, we also fit a single decision tree on each data set. The purpose 
of the experiments is to demonstrate the validity of our new multicategory 
boosting algorithms. 

We fixed the tree size in four algorithms. The decision stumps are com- 
monly used as base learners in AdaBoost, and hence in AdaBoost.MH. In 
AdaBoost.ML, we require each base learner T\. to be a weak classifier for the 
m-class problem (the accuracy of is better than 1/m). In the binary clas- 
sification case, two-node trees are generally sufficient for that purpose. Simi- 
larly, we suggest using classification trees with (at least) m terminal nodes in 
m-class problems. GentleBoost combines regression trees. The chosen value 
for the number of terminal nodes (J) should reflect the level of dominant 
interactions in /(x) [Hastie, Tibshirani and Friedman (2001)]. J = 2 is of- 
ten inadequate, and J > 10 is also very unlikely. Following the suggestion in 
Hastie, Tibshirani and Friedman (2001), we used 8-node regression trees in 
GentleBoost. 

Table 1 summarizes these data sets and the test error rates using a single 
decision tree. Table 2 shows the test error rates. Figure 1 displays the test 
error curves of the four algorithms on waveform and vowel. The test-error 
curves of GentleBoost and AdaBoost.ML show the characteristic pattern 
of a boosting procedure: the test error steadily decreases as the boosting 
iterations proceed and then stays (almost) flat. These experiments clearly 
show that the new algorithms work well and have very competitive per- 
formances as AdaBoost.MH. GentleBoost seems to perform slightly better 
than AdaBoost.MH. 

We do not intend to argue that the new algorithms always outperform Ad- 
aBoost.MH. In fact, AdaBoost.MH is asymptotical optimal [Zhang (2004a)], 
thus, it is almost impossible to have a competitor that can always outperform 
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waveform: 3 classes 





Fig. 1. Waveform and vowel data: test error rate as a function of boosting steps. To 
better show the differences among the three algorithms, we start the plots from step 21 for 
waveform data and step 11 for vowel data. 
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AdaBoost.MH. We are satisfied with the fact that our new multicategory 
boosting algorithms can do as well as AdaBoost.MH and sometimes perform 
slightly better than AdaBoost.MH. The working algorithms demonstrate the 
usefulness of the multicategory Fisher-consistent loss functions. 

6. Conclusion. In this paper we have proposed the multicategory Fisher- 
consistent condition and characterized a family of convex losses that are 
universally Fisher-consistent for multicategory classification. To show the 
usefulness of the multicategory Fisher-consistent loss functions, we have 
also derived some new multicategory boosting algorithms by minimizing 
the empirical loss. These new algorithms have been empirically tested on 
several benchmark data sets. Fisher-consistency is the first step to estab- 
lish the Bayes risk consistency of the multicategory boosting algorithms 
[Lin (2004), Zhang (2004a)]. It is interesting to prove multicategory Gentle- 
Boost and AdaBoost.ML converge to the Bayes classifier in terms of clas- 
sification error. In future work we will follow Koltchinskii and Panchenko 
(2002), Blanchard, Lugosi and Vayatis (2004), Lugosi and Vayatis (2004), 
Biihlmann and Yu (2003) and Zhang (2004b) to study the convergence rate 
of the proposed multicategory boosting algorithms. 

APPENDIX: PROOFS 



Proof of Theorem 1. By definition of the Fisher-consistent loss, we 
need to show that (3.4) has a unique solution and the condition (3.5) is 
satisfied. Using the Lagrangian multiplier method, we define 

L(f) = <Kfl)pi + ■■■+ Hfm)p m + A(/l + • • • + f m ). 



Table 2 

Comparing GentleBoost and AdaBoost.ML with AdaBoost.MH. Inside (■) are the 
standard errors of the test error rates 





AdaBoost.MH 


AdaBoost.ML 


GentleBoost 


Waveform 


18.22% 


18.30% 


17.74% 




(0.55%) 


(0.55%) 


(0.54%) 


Vowel 


50.87% 


47.18% 


45.67% 




(7.07%) 


(7.06%) 


(7.04%) 


Opdigits 


5.18% 


5.40% 


5.01% 




(0.52%) 


(0.53%) 


(0.51%) 


Image segmentation 


5.29% 


5.42% 


5.38% 




(0.48%) 


(0.49%) 


(0.49%) 


Pendigits 


5.86% 


4.09% 


3.69% 




(0.40%) 


(0.33%) 


(0.32%) 
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Then we have 

(6.1) ^f- = <t>'(f j ) Pj + \ = 0, j = l,...,m. 

4>"(t) > Vt, hence, <j)' has an inverse function, denoted by ip. Equation (6.1) 
gives fj = if)(— j-). By the constraint on /, we have 



A 



(6.2) E^("-)=0- 

3=1 V P]J 

(J)' is a strict monotonously increasing function, so is tp. Thus, the left-hand 
side (LHS) of (6.2) is a decreasing function of A. It suffices to show that 
equation (6.2) has a root A*, which is the unique root. Then it is easy to see 
that fj = is the unique minimizer of (3.4), for the Hessian matrix of 

L(f) is a diagonal matrix and the jth diagonal element is 9 gjP = 4>"(fj) > 

0. Note that when A = -<p'(0) > 0, we have j- > -c/>'(0), then < 

V#'(0)) = 0. So the LHS of (6.2) is negative' when A = -0'(O) > 0.' On 
the other hand, let us define A = {a : <j)'{a) =0}. If A is an empty set, then 
4>'{t) — > 0— as t — > oo (since 4> is a convex loss). If A is not empty, denote a* = 
mi A. By the fact ^'(0) < 0, we conclude a* > 0. Hence, 4> (t) ^0— as t — > 
a* — . In both cases, we see that 3 a small enough Ao > such that ^ a ) > 
for all j. So the LHS of (6.2) is positive when A = Ao > 0. Therefore, there 
must be a positive A* G (Ao,— 0'(O)) such that equation (6.2) holds. 



For (3.5), let Pl > Pj Vj + 1, then > -L. V j + 1, so h > fj Vj + 1. 



pi p. 

Using (6.1), we get pj = - -jf* . YJj=iVj = 1 requires 

X* 



3) 

m ^ 

1. 



E 

i 



So it follows that A* = -(E^i V^C/i)) -1 - Then ( 4 - x ) is obtained. □ 

Proof of Theorem 2. First, by the convexity of ( and the fact £ > 
<f>(ti), we know that the minimizer of (3.4) always exists. We only need to 
show the uniqueness of the solution and (3.5). Without loss of generality, 
let pi > p2 > P3 > ■ ■ ■ > Pm-i > Pm- Suppose / is a minimizer. Substituting 
fm = -(ET=l fj) into ( 3 - 2 )> we have 

m m—1 / /m—1 \ \ 

(6.3) c(p, /)=£ c(/i)pj = E c(/i)« + c - E /i W 

j=l 3=1 \ \j=l )) 

Differentiating (6.3) yields 

C'{fj)Pj - C(L)Vm = 0, j = 1, 2, . . . ,m - 1, 
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or equivalently, 

(6.4) C'(/j)Pj = — A, j = 1 , 2, ...,m for some A. 

There is one and only one such A satisfying (6.4). Otherwise, let Ai > A2 and 
/(Ai), /(A2) such that 

C'(/;(Ai))Pi = -Ai, C'(/i(A 2 )>i = -A 2 Vj. 

Then we see that C'(/j(Ai)) < C'(/j(A 2 )), so /j-(Ai) < jj(A 2 ) for all j. This 
is clearly a contradiction to the fact that both /(Ai) and /(A 2 ) satisfy the 
constraint J2]Li fj = 0. 

Observe that if > £'(£) > 0'(i 2 ), C' has an inverse denoted as ip. 3 a small 
enough A : —<t>'{h)Pm > A > such that ip(—^) exists and -0(-^ Q ) > for 

all j. Thus, the A in (6.4) must be larger than Ao- Otherwise fj > ip(-yr) > 
for all j, which clearly contradicts YTjLi fj = 0- Furthermore, ('(t) > t/>'(t 2 ) 
for all t, so A < — </>'(i 2 )p m . Then let us consider the following two situations: 

Case LAG (A , -<t>'{t 2 )p m ). Then ^(-^) exists Vj, and f j = ^(-±) is 
the unique minimizer. 

Case 2. A = — / (t 2 )p m ,. Similarly, for j < (m — 1), ip(--p) exists, and 

Therefore, we prove the uniqueness of the minimizer /. For (3.5), note that 
C'(A) = ~j[ > - j- = C(fj) for j > 2, hence, we must have /1 > Vj, due 
to the convexity of The formula (4.1) follows (6.4) and can be derived 
using the same arguments as in the proof of Theorem 1. □ 
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